Google Play Store Analysis

This is an exploratory data analysis project to examine how different variables such as the size or the price of an app affect the ratings and amount of reviews or installs an app recieves. After exploring the relationship between variables I will attempt to answer some questions about the data by testing for significance and create a linear regression model that can predict the amount of install an app will recieve based on the variables available in the dataset.

Data Cleaning

library(psych) 
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(gmodels) 
library(MASS)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
library(cluster)  
library(fpc)  
library(corrplot)
## corrplot 0.84 loaded
library(FactoMineR)
library(corrplot)
library(scatterplot3d) 
library(readr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(tidyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(mosaicCore)
## 
## Attaching package: 'mosaicCore'
## The following objects are masked from 'package:dplyr':
## 
##     count, tally
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:psych':
## 
##     logit
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ stringr 1.4.0
## ✔ purrr   0.3.2     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()           masks psych::%+%()
## ✖ ggplot2::alpha()         masks psych::alpha()
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ mosaicCore::count()      masks dplyr::count()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ dplyr::recode()          masks car::recode()
## ✖ dplyr::select()          masks MASS::select()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ purrr::some()            masks car::some()
## ✖ Hmisc::src()             masks dplyr::src()
## ✖ Hmisc::summarize()       masks dplyr::summarize()
## ✖ mosaicCore::tally()      masks dplyr::tally()
## ✖ lubridate::union()       masks base::union()
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## The following object is masked from 'package:survival':
## 
##     cluster
library(nnet)
library(plotrix)
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:gplots':
## 
##     plotCI
## The following object is masked from 'package:psych':
## 
##     rescale
g_apps <- read.csv("Google-Playstore-Full.csv", header = T)
head(g_apps)
##                                             App.Name         Category
## 1                           DoorDash - Food Delivery   FOOD_AND_DRINK
## 2 TripAdvisor Hotels Flights Restaurants Attractions TRAVEL_AND_LOCAL
## 3                                             Peapod         SHOPPING
## 4                    foodpanda - Local Food Delivery   FOOD_AND_DRINK
## 5                          My CookBook Pro (Ad Free)   FOOD_AND_DRINK
## 6                            Safeway Online Shopping   FOOD_AND_DRINK
##        Rating Reviews     Installs               Size Price Content.Rating
## 1 4.548561573  305034   5,000,000+ Varies with device     0       Everyone
## 2 4.400671482 1207922 100,000,000+ Varies with device     0       Everyone
## 3 3.656329393    1967     100,000+               1.4M     0       Everyone
## 4 4.107232571  389154  10,000,000+                16M     0       Everyone
## 5 4.647752285    2291      10,000+ Varies with device $5.99       Everyone
## 6  3.82532239    2559     100,000+                23M     0       Everyone
##         Last.Updated    Minimum.Version     Latest.Version X X.1 X.2 X.3
## 1     March 29, 2019 Varies with device Varies with device            NA
## 2     March 29, 2019 Varies with device Varies with device            NA
## 3 September 20, 2018         5.0 and up              2.2.0            NA
## 4     March 22, 2019         4.2 and up             4.18.2            NA
## 5      April 1, 2019 Varies with device Varies with device            NA
## 6     March 29, 2019         5.0 and up              7.6.0            NA
str(g_apps)
## 'data.frame':    267052 obs. of  15 variables:
##  $ App.Name       : Factor w/ 244406 levels "_PRISM","--SB Kiosk App--",..: 74410 223974 168267 91440 153012 191111 241995 213156 86853 29047 ...
##  $ Category       : Factor w/ 68 levels ""," Accounting",..: 30 66 61 30 30 30 66 30 66 30 ...
##  $ Rating         : Factor w/ 99856 levels " Economics"," Lessons",..: 75011 58414 11903 30767 85185 16562 44402 69013 16127 74642 ...
##  $ Reviews        : Factor w/ 24545 levels "","1","10","100",..: 11780 2011 7233 14352 8799 9912 15192 3790 22757 20222 ...
##  $ Installs       : Factor w/ 38 levels " Xmax X","0+",..: 23 12 13 9 10 13 9 23 10 23 ...
##  $ Size           : Factor w/ 1248 levels "1,000,000+","1,000+",..: 1248 1248 31 153 1248 248 1248 1248 454 1248 ...
##  $ Price          : Factor w/ 504 levels "$0.56","$0.67",..: 488 488 488 488 398 488 488 488 488 488 ...
##  $ Content.Rating : Factor w/ 12 levels "$0.99","$2.49",..: 8 8 8 8 8 8 11 8 8 8 ...
##  $ Last.Updated   : Factor w/ 2751 levels "0","500,000+",..: 1761 1761 2621 1706 12 1761 1753 1786 698 1820 ...
##  $ Minimum.Version: Factor w/ 100 levels "","0","1.0 - 6.0",..: 100 100 80 70 100 80 100 100 72 100 ...
##  $ Latest.Version : Factor w/ 22994 levels ""," 1.0.1.6",..: 22890 22890 11185 16974 22890 20210 22890 22890 11936 22890 ...
##  $ X              : Factor w/ 16 levels "","1.0.0","1.0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X.1            : Factor w/ 4 levels "","1","4.4 and up",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X.2            : Factor w/ 3 levels "","4.4 and up",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X.3            : num  NA NA NA NA NA NA NA NA NA NA ...
summary(g_apps)
##      App.Name                     Category              Rating      
##  ????    :   766   EDUCATION          : 33394   5          : 23804  
##  ?????   :   635   TOOLS              : 21592   4          :  5469  
##  ??????  :   608   BOOKS_AND_REFERENCE: 21377   4.5        :  3519  
##  ??????? :   415   ENTERTAINMENT      : 20604   3          :  2581  
##  ????????:   334   MUSIC_AND_AUDIO    : 17876   4.333333492:  2204  
##  (Other) :264293   LIFESTYLE          : 15034   4.666666508:  2167  
##  NA's    :     1   (Other)            :137175   (Other)    :227308  
##     Reviews           Installs                     Size       
##  1      :  9203   10,000+ :60531   Varies with device: 11726  
##  2      :  7581   1,000+  :48880   11M               :  7312  
##  3      :  6445   100,000+:37498   12M               :  6362  
##  4      :  5624   5,000+  :26360   13M               :  5569  
##  5      :  4962   50,000+ :22795   14M               :  5266  
##  6      :  4479   100+    :18502   15M               :  5157  
##  (Other):228758   (Other) :52486   (Other)           :225660  
##      Price             Content.Rating           Last.Updated   
##  0      :255428   Everyone    :241578   April 2, 2019 :  4000  
##  $0.99  :  2317   Teen        : 17261   April 1, 2019 :  3331  
##  $1.99  :  1552   Everyone 10+:  4661   March 28, 2019:  2736  
##  $2.99  :  1351   Mature 17+  :  3489   March 25, 2019:  2681  
##  $4.99  :   883   Unrated     :    33   March 29, 2019:  2670  
##  $3.99  :   767   0           :    12   March 26, 2019:  2650  
##  (Other):  4754   (Other)     :    18   (Other)       :248984  
##      Minimum.Version             Latest.Version  
##  4.1 and up  :70848   1                 : 33002  
##  4.0.3 and up:49324   1.1               : 11714  
##  4.0 and up  :37837   Varies with device:  8555  
##  4.4 and up  :28250   1.2               :  8205  
##  5.0 and up  :17413   2                 :  7126  
##  4.2 and up  :13629   1.3               :  5922  
##  (Other)     :49751   (Other)           :192528  
##                   X                         X.1        
##                    :267034                    :267049  
##  1.0.0             :     2   1                :     1  
##  1.0.1             :     2   4.4 and up       :     1  
##  Varies with device:     2   February 14, 2019:     1  
##  1.1               :     1                             
##  1.15.1            :     1                             
##  (Other)           :    10                             
##          X.2              X.3        
##            :267050   Min.   :9.1     
##  4.4 and up:     1   1st Qu.:9.1     
##  9.0.3     :     1   Median :9.1     
##                      Mean   :9.1     
##                      3rd Qu.:9.1     
##                      Max.   :9.1     
##                      NA's   :267051

Looks like there are some rows with misaligned data and columns 12, 13, 14 and 15 are all NA (completely blank) so I’ll remove those columns.

# Removing misaligned rows
m1 <- which(as.character(g_apps$X) == "1.0.0")
m2 <- which(as.character(g_apps$X) == "1.0.1")
m3 <- which(as.character(g_apps$X) == "Varies with device")
m4 <- which(as.character(g_apps$X) == "1.1")
m5 <- which(as.character(g_apps$X) == "1.15.1")
m6 <- which(as.character(g_apps$X) == "1.2")
m7 <- which(as.character(g_apps$X) == "1.54")
m8 <- which(as.character(g_apps$X) == "1.6")
m9 <- which(as.character(g_apps$X) == "2.2.0")
m10 <- which(as.character(g_apps$X) == "2.3")
m11 <- which(as.character(g_apps$X) == "4.0.0.0")
m12 <- which(as.character(g_apps$X) == "4.0.1")
m13 <- which(as.character(g_apps$X) == "4.0.3 and up")
m14 <- which(as.character(g_apps$X) == "April 2, 2019")
m15 <- which(as.character(g_apps$X) == "Everyone")
m16 <- which(as.character(g_apps$X) == "Varies with device")

misaligned <- c(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, m12, m13, m14, m15, m16)
g_apps1 <- g_apps[-misaligned, ]
head(g_apps1) # Columns X - X.3 are now empty, let's drop those columns from the dataset
##                                             App.Name         Category
## 1                           DoorDash - Food Delivery   FOOD_AND_DRINK
## 2 TripAdvisor Hotels Flights Restaurants Attractions TRAVEL_AND_LOCAL
## 3                                             Peapod         SHOPPING
## 4                    foodpanda - Local Food Delivery   FOOD_AND_DRINK
## 5                          My CookBook Pro (Ad Free)   FOOD_AND_DRINK
## 6                            Safeway Online Shopping   FOOD_AND_DRINK
##        Rating Reviews     Installs               Size Price Content.Rating
## 1 4.548561573  305034   5,000,000+ Varies with device     0       Everyone
## 2 4.400671482 1207922 100,000,000+ Varies with device     0       Everyone
## 3 3.656329393    1967     100,000+               1.4M     0       Everyone
## 4 4.107232571  389154  10,000,000+                16M     0       Everyone
## 5 4.647752285    2291      10,000+ Varies with device $5.99       Everyone
## 6  3.82532239    2559     100,000+                23M     0       Everyone
##         Last.Updated    Minimum.Version     Latest.Version X X.1 X.2 X.3
## 1     March 29, 2019 Varies with device Varies with device            NA
## 2     March 29, 2019 Varies with device Varies with device            NA
## 3 September 20, 2018         5.0 and up              2.2.0            NA
## 4     March 22, 2019         4.2 and up             4.18.2            NA
## 5      April 1, 2019 Varies with device Varies with device            NA
## 6     March 29, 2019         5.0 and up              7.6.0            NA
g_apps1 <- g_apps1[,-c(12, 13, 14,15)]
summary(g_apps1)
##      App.Name                     Category              Rating      
##  ????    :   766   EDUCATION          : 33394   5          : 23804  
##  ?????   :   635   TOOLS              : 21592   4          :  5469  
##  ??????  :   608   BOOKS_AND_REFERENCE: 21377   4.5        :  3519  
##  ??????? :   415   ENTERTAINMENT      : 20604   3          :  2581  
##  ????????:   334   MUSIC_AND_AUDIO    : 17876   4.333333492:  2204  
##  (Other) :264275   LIFESTYLE          : 15034   4.666666508:  2167  
##  NA's    :     1   (Other)            :137157   (Other)    :227290  
##     Reviews           Installs                     Size       
##  1      :  9203   10,000+ :60531   Varies with device: 11726  
##  2      :  7581   1,000+  :48880   11M               :  7312  
##  3      :  6445   100,000+:37498   12M               :  6362  
##  4      :  5622   5,000+  :26360   13M               :  5569  
##  5      :  4960   50,000+ :22795   14M               :  5266  
##  6      :  4479   100+    :18502   15M               :  5157  
##  (Other):228744   (Other) :52468   (Other)           :225642  
##      Price                Content.Rating           Last.Updated   
##  0      :255428   Everyone       :241578   April 2, 2019 :  4000  
##  $0.99  :  2317   Teen           : 17261   April 1, 2019 :  3331  
##  $1.99  :  1552   Everyone 10+   :  4661   March 28, 2019:  2736  
##  $2.99  :  1351   Mature 17+     :  3489   March 25, 2019:  2681  
##  $4.99  :   883   Unrated        :    33   March 29, 2019:  2670  
##  $3.99  :   767   Adults only 18+:    12   March 26, 2019:  2650  
##  (Other):  4736   (Other)        :     0   (Other)       :248966  
##      Minimum.Version             Latest.Version  
##  4.1 and up  :70848   1                 : 33002  
##  4.0.3 and up:49324   1.1               : 11714  
##  4.0 and up  :37837   Varies with device:  8553  
##  4.4 and up  :28250   1.2               :  8205  
##  5.0 and up  :17413   2                 :  7126  
##  4.2 and up  :13629   1.3               :  5922  
##  (Other)     :49733   (Other)           :192512

Formating the data for analysis.

g_apps2 <- g_apps1
# Removing k and M from Size variable
g_apps2$Size <- as.character(g_apps2$Size)
g_apps2$Size <- gsub("\\.5k", "500", g_apps2$Size)
g_apps2$Size <- gsub("\\.1M", "100000", g_apps2$Size)
g_apps2$Size <- gsub("\\.2M", "200000", g_apps2$Size)
g_apps2$Size <- gsub("\\.3M", "300000", g_apps2$Size)
g_apps2$Size <- gsub("\\.4M", "400000", g_apps2$Size)
g_apps2$Size <- gsub("\\.5M", "500000", g_apps2$Size)
g_apps2$Size <- gsub("\\.6M", "600000", g_apps2$Size)  
g_apps2$Size <- gsub("\\.7M", "700000", g_apps2$Size)
g_apps2$Size <- gsub("\\.8M", "800000", g_apps2$Size)
g_apps2$Size <- gsub("\\.9M", "900000", g_apps2$Size)
g_apps2$Size <- gsub("\\.0M","000000",g_apps2$Size)
g_apps2$Size <- gsub("\\M", "000000", g_apps2$Size)
g_apps2$Size <- gsub("\\k", "000", g_apps2$Size)
g_apps2$Size <- gsub("\\,", "", g_apps2$Size)
g_apps2$Size <- gsub("\\+", "", g_apps2$Size)
g_apps2$Size <- as.factor(g_apps2$Size)

# Removing + and , from Installs 
g_apps2$Installs <- gsub("\\+", "", g_apps2$Installs)
g_apps2$Installs <- gsub(",", "", g_apps2$Installs)
g_apps2$Installs <- as.factor(g_apps2$Installs)

# Remove the $ from Price
g_apps2$Price <- gsub("\\$", "", g_apps2$Price)
g_apps2$Price <- as.factor(g_apps2$Price)

# Change last updated to date format
g_apps2$Last.Updated <- gsub("January", "1-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("February", "2-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("March", "3-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("April", "4-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("May", "5-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("June", "6-", g_apps2$Last.Updated)  
g_apps2$Last.Updated <- gsub("July", "7-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("August", "8-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("September", "9-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("October", "10-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("November", "11-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub("December", "12-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub(",", "-", g_apps2$Last.Updated)
g_apps2$Last.Updated <- gsub(" ", "", g_apps2$Last.Updated)

g_apps2$Last.Updated <- strptime(g_apps2$Last.Updated,format="%m-%d-%Y")
g_apps2$Last.Updated <- as.Date(g_apps2$Last.Updated, format = "%d-%m-%y")

# Remove "and up" from Minimum Version
g_apps2$Minimum.Version <- gsub(" and up", "+", g_apps2$Minimum.Version)
g_apps2$Minimum.Version <- as.factor(g_apps2$Minimum.Version)

summary(g_apps2)
##      App.Name                     Category              Rating      
##  ????    :   766   EDUCATION          : 33394   5          : 23804  
##  ?????   :   635   TOOLS              : 21592   4          :  5469  
##  ??????  :   608   BOOKS_AND_REFERENCE: 21377   4.5        :  3519  
##  ??????? :   415   ENTERTAINMENT      : 20604   3          :  2581  
##  ????????:   334   MUSIC_AND_AUDIO    : 17876   4.333333492:  2204  
##  (Other) :264275   LIFESTYLE          : 15034   4.666666508:  2167  
##  NA's    :     1   (Other)            :137157   (Other)    :227290  
##     Reviews          Installs                     Size       
##  1      :  9203   10000  :60531   Varies with device: 11726  
##  2      :  7581   1000   :48880   11000000          :  7312  
##  3      :  6445   100000 :37498   12000000          :  6362  
##  4      :  5622   5000   :26360   13000000          :  5569  
##  5      :  4960   50000  :22795   14000000          :  5266  
##  6      :  4479   100    :18502   15000000          :  5157  
##  (Other):228744   (Other):52468   (Other)           :225642  
##      Price                Content.Rating    Last.Updated       
##  0      :255428   Everyone       :241578   Min.   :2009-02-11  
##  0.99   :  2317   Teen           : 17261   1st Qu.:2018-04-30  
##  1.99   :  1552   Everyone 10+   :  4661   Median :2018-11-20  
##  2.99   :  1351   Mature 17+     :  3489   Mean   :2018-06-22  
##  4.99   :   883   Unrated        :    33   3rd Qu.:2019-02-22  
##  3.99   :   767   Adults only 18+:    12   Max.   :2019-04-04  
##  (Other):  4736   (Other)        :     0                       
##  Minimum.Version            Latest.Version  
##  4.1+   :70848   1                 : 33002  
##  4.0.3+ :49324   1.1               : 11714  
##  4.0+   :37837   Varies with device:  8553  
##  4.4+   :28250   1.2               :  8205  
##  5.0+   :17413   2                 :  7126  
##  4.2+   :13629   1.3               :  5922  
##  (Other):49733   (Other)           :192512

Univariate Analysis

Now I’ll take a look at each variable and check for a normal distribution since the tests I’ll be performing assume normal distribution.

g_apps2$Rating <- as.numeric(as.character(g_apps2$Rating))
g_apps2$Reviews <- as.numeric(as.character(g_apps2$Reviews))
g_apps3 <- g_apps2

# -------- Rating ----------#
rating <- g_apps3$Rating
hist(rating, breaks = 100, col="#00DCFF")

summary(rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.018   4.382   4.269   4.649   5.000
len_rate <- length(which(rating>=3))
len_rateP <- len_rate/(length(rating))
len_rateP # 96.9% of apps are rating 3 or higher
## [1] 0.9695095
# Look for normal distribution
par(mfrow=c(3,1))
plot(density(rating, na.rm = T))
plot(density(log(rating), na.rm = T))
plot(density(sqrt(rating), na.rm = T)) 

low_outliers <- fivenum(rating, na.rm = T)[2]-IQR(rating, na.rm = T)*1.5
low_outliers # 3.07
## [1] 3.071228
low <- which(rating < low_outliers)
length(low) # 11627 lower outliers because the data is left skewed, removing for normal distribution
## [1] 11627

We can see that the all transformations of the data are skewed, so I’ll try removing low outliers

g_apps4 <- g_apps3[-low,]
par(mfrow=c(1,1))
hist(g_apps4$Rating, col = c("#00DCFF"), main = "Distribution of Ratings", breaks = 100)

plot(density(g_apps4$Rating, na.rm = T), col = "#444444")

plot(density(log(g_apps4$Rating), na.rm = T), col="#444444", lwd=2.5)

plot(density(sqrt(g_apps4$Rating), na.rm = T), col="#444444", lwd=2.5)

hist(g_apps4$Rating, col=c("#00DCFF"), xlim=c(3,5), freq = F, breaks=seq(.5, 20, 0.1),
     xlab = "App Rating", main = "Distribution of App Ratings")
rug(jitter(g_apps4$Rating), col="#444444")
lines(density(g_apps4$Rating), col="#FFD800", lwd=2)
box()

boxplot(g_apps4$Rating, main="Ratings", col="#00DCFF")

The ratings are still left skewed, but I will leave it as is.

# --------- Category --------# 
cat <- table(g_apps4$Category)
sortcat <- sort(cat, decreasing = TRUE)
pcat <- round(prop.table(cat), 2)
pcat <- sort(pcat, decreasing = TRUE)
pcat
## 
##                        EDUCATION              BOOKS_AND_REFERENCE 
##                             0.13                             0.08 
##                    ENTERTAINMENT                            TOOLS 
##                             0.08                             0.08 
##                  MUSIC_AND_AUDIO                        LIFESTYLE 
##                             0.07                             0.06 
##                         BUSINESS                          FINANCE 
##                             0.04                             0.04 
##                  PERSONALIZATION               HEALTH_AND_FITNESS 
##                             0.04                             0.03 
##               NEWS_AND_MAGAZINES                      PHOTOGRAPHY 
##                             0.03                             0.03 
##                     PRODUCTIVITY                    COMMUNICATION 
##                             0.03                             0.02 
##                         SHOPPING                           SOCIAL 
##                             0.02                             0.02 
##                           SPORTS                 TRAVEL_AND_LOCAL 
##                             0.02                             0.02 
##                   ART_AND_DESIGN                AUTO_AND_VEHICLES 
##                             0.01                             0.01 
##                   FOOD_AND_DRINK                      GAME_ACTION 
##                             0.01                             0.01 
##                      GAME_ARCADE                      GAME_CASUAL 
##                             0.01                             0.01 
##                 GAME_EDUCATIONAL                      GAME_PUZZLE 
##                             0.01                             0.01 
##                  GAME_SIMULATION              MAPS_AND_NAVIGATION 
##                             0.01                             0.01 
##                          MEDICAL                    VIDEO_PLAYERS 
##                             0.01                             0.01 
##                          WEATHER                                  
##                             0.01                             0.00 
##                       Accounting                     Alfabe �?ren 
##                             0.00                             0.00 
##                    Breaking News                   Channel 2 News 
##                             0.00                             0.00 
##                     ETEA & MDCAT  Islamic Name Boy & Girl+Meaning 
##                             0.00                             0.00 
##                         Mexpost)        not notified you follow - 
##                             0.00                             0.00 
##                         Podcasts   Romantic Song Music Love Songs 
##                             0.00                             0.00 
##                 Speaker Pro 2019       super loud speaker booster 
##                             0.00                             0.00 
##                       Tour Guide                    T�rk Alfabesi 
##                             0.00                             0.00 
##                                )                                6 
##                             0.00                             0.00 
##                           BEAUTY                           COMICS 
##                             0.00                             0.00 
##                           DATING                           EVENTS 
##                             0.00                             0.00 
##                   GAME_ADVENTURE                       GAME_BOARD 
##                             0.00                             0.00 
##                        GAME_CARD                      GAME_CASINO 
##                             0.00                             0.00 
##                       GAME_MUSIC                      GAME_RACING 
##                             0.00                             0.00 
##                GAME_ROLE_PLAYING                      GAME_SPORTS 
##                             0.00                             0.00 
##                    GAME_STRATEGY                      GAME_TRIVIA 
##                             0.00                             0.00 
##                        GAME_WORD                       Gate ALARM 
##                             0.00                             0.00 
##                   HOUSE_AND_HOME               LIBRARIES_AND_DEMO 
##                             0.00                             0.00 
##                        PARENTING                           TRAVEL 
##                             0.00                             0.00

Too many catergories; grouping them together for a more even spread.

g_apps5 <- g_apps4
g_apps5$Category <- as.character(g_apps5$Category)

g_apps5$Category[g_apps5$Category %in% "BUSINESS"] <- "EDUCATION"
g_apps5$Category[g_apps5$Category %in% "FINANCE"] <- "EDUCATION"
g_apps5$Category[g_apps5$Category %in% "ETEA & MDCAT"] <- "EDUCATION"
g_apps5$Category[g_apps5$Category %in% "PARENTING"] <- "EDUCATION"
g_apps5$Category[g_apps5$Category %in% "BOOKS_AND_REFERENCE"] <- "EDUCATION"
g_apps5$Category[g_apps5$Category %in% "LIBRARIES_AND_DEMO"] <- "EDUCATION"

g_apps5$Category[g_apps5$Category %in% "HEALTH_AND_FITNESS"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "FOOD_AND_DRINK"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "MEDICAL"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "BEAUTY"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "HEALTH"] <- "LIFESTYLE"


g_apps5$Category[g_apps5$Category %in% "NEWS_AND_MAGAZINES"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "SHOPPING"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "TRAVEL_AND_LOCAL"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "ART_AND_DESIGN"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "AUTO_AND_VEHICLES"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Mexpost)"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "not notified you follow -"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "TÔøΩrk Alfabesi"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Alfabe ÔøΩ?ren"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% ")"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "6"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Tour Guide"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "TRAVEL"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "BEAUTY"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Mexpost)"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Breaking News"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "Channel 2 News"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "DATING"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "EVENTS"] <- "LIFESTYLE"
g_apps5$Category[g_apps5$Category %in% "SOCIAL"] <- "LIFESTYLE"


g_apps5$Category[g_apps5$Category %in% "PHOTOGRAPHY"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "COMMUNICATION"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "MAPS_AND_NAVIGATION"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "VIDEO_PLAYERS"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "WEATHER"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "HOUSE_AND_HOME"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "VIDEO_PLAYERS"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "Gate ALARM"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "Islamic Name Boy & Girl+Meaning"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "PERSONALIZATION"] <- "PRODUCTIVITY"
g_apps5$Category[g_apps5$Category %in% "TOOLS"] <- "PRODUCTIVITY"


g_apps5$Category[g_apps5$Category %in% "MUSIC_AND_AUDIO"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "Romantic Song Music Love Songs"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "Speaker Pro 2019"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "super loud speaker booster"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "MUSIC_AND_AUDIO"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "SPORTS"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_ACTION"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_ARCADE"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_CASUAL"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_EDUCATIONAL"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_PUZZLE"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_SIMULATION"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "COMICS"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_ADVENTURE"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_BOARD"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_CARD"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_CASINO"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_MUSIC"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_RACING"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_ROLE_PLAYING"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_SPORTS"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_STRATEGY"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_TRIVIA"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "GAME_WORD"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "Podcasts"] <- "ENTERTAINMENT"
g_apps5$Category[g_apps5$Category %in% "MUSIC"] <- "ENTERTAINMENT"

g_apps5$Category <- as.factor(g_apps5$Category)

ct <- table(g_apps5$Category)
pct <- prop.table(ct)
pct <- round(pct, 2)
pct
## 
##     EDUCATION ENTERTAINMENT     LIFESTYLE  PRODUCTIVITY 
##          0.29          0.25          0.22          0.24
par(mfrow=c(1,1))
barplot(pct, main = "Proportion of Apps by Category", col = c("#00DCFF", "#F83648", "#FFD800", "#04F075"), ylim = c(0,0.5), xlab = "Category of App", ylab = "Proportion of Apps")
legend("topright", fill = c("#00DCFF", "#F83648", "#FFD800", "#04F075"), legend = levels(g_apps5$Category))
box()

g_apps6 <- g_apps5

str(g_apps6$Reviews)
##  num [1:255407] 305034 1207922 1967 389154 2291 ...
summary(g_apps6$Reviews)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        1       18      101    15246      708 86214292
hist(g_apps6$Reviews, breaks = 10, col = "#04F075")

plot(density(g_apps6$Reviews)) # Nowhere near normal distribution

boxplot(g_apps6$Reviews, col = "#00DCFF")

options(scipen = 99) 

par(mfrow=c(1,1))
plot(density(log(g_apps6$Reviews)))

hist(log(g_apps6$Reviews), col = "#FFD800", breaks = 100) # Way better, but right skewed a bit

# Removing outliers
log_reviews <- log(g_apps6$Reviews)
revup_limit <- fivenum(log_reviews, na.rm=T)[4]+IQR(log_reviews)*1.5
revup_outliers <- which(log_reviews > revup_limit)
length(revup_outliers) # 2796 outliers
## [1] 2796
revlow_limit <- fivenum(log_reviews)[2]-IQR(log_reviews)*1.5
revlow_outliers <- which(log_reviews < revlow_limit)
length(revlow_outliers) # 0 outliers
## [1] 0
g_apps6$Reviews <- log(g_apps6$Reviews) # Using the log transformation, because although it is right skewed, it's more normally distruted
g_apps7 <- g_apps6[-revup_outliers,]
plot(density(g_apps7$Reviews), main = "Density Plot log Reviews", lwd = 3, xlab = "Log of Reviews", col="#00DCFF")

#-------------- Installs ---------------#
summary(g_apps7$Installs)
##          0          1         10        100       1000      10000 
##         45        426       4069      16533      45961      58598 
##     100000    1000000   10000000  100000000 1000000000          5 
##      36736      12599       1221         22          0        640 
##         50        500       5000      50000     500000    5000000 
##       3486      12693      25268      22184       9685       2414 
##   50000000  500000000 5000000000 
##         26          5          0
summary(as.numeric(as.character(g_apps7$Installs)))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0      1000     10000    210881     50000 500000000
g_apps7$Installs <- as.numeric(as.character(g_apps7$Installs))
hist(g_apps7$Installs, col = "#00DCFF", breaks = 1000)

plot(density(g_apps7$Installs), col = "#00DCFF", lwd=3) 

# Why is the data so skewed?
toohigh <- which(g_apps7$Installs %in% 500000000)
length(toohigh) # Only 5 apps have up to 5 million installs
## [1] 5
boxplot(g_apps7$Installs)

# Let me try data transformation to see if I can find a more normal distribution.
plot(density(log(g_apps7$Installs)), col = "#04F075", lwd = 3)

plot(density(sqrt(g_apps7$Installs)), col = "#04F075", lwd = 3)

plot(density(g_apps7$Installs^(1/3)), col = "#04F075", lwd = 3) # None of these data transformations achieve normal distribution.

# As a numeric variable, Installs is extremely skewed so I will convert it to a factor with 5 groups.
g_apps7$Installs <- as.character(g_apps7$Installs)
g_apps7$Installs[g_apps7$Installs %in% c(0,1,5,10,50,100,500)] <- "0-500"
g_apps7$Installs[g_apps7$Installs %in% c(1000,5000)] <- "1000-10000"
g_apps7$Installs[g_apps7$Installs %in% c(10000,50000)] <- "10000-100000"
g_apps7$Installs[g_apps7$Installs %in% c(100000,500000)] <- "100000-1000000"
g_apps7$Installs[g_apps7$Installs %in% c(1000000,5000000,10000000,50000000, 100000000,500000000)] <- "1000000+"

g_apps7$Installs <- as.factor(g_apps7$Installs)
summary(g_apps7$Installs)
##          0-500     1000-10000   10000-100000 100000-1000000       1000000+ 
##          37892          71229          80782          46421          16287
install_tab <- table(g_apps7$Installs)
install_ptab <- round(prop.table(install_tab), 2)
install_ptab
## 
##          0-500     1000-10000   10000-100000 100000-1000000       1000000+ 
##           0.15           0.28           0.32           0.18           0.06
barplot(install_ptab, main = "Installs", xlab = "Number of Installs", ylab = "Proportion of Data", ylim = c(0,0.4), col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"))
legend("topright", fill = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"), legend = c("< 1k Installs", "1k-9.9k Installs", "10k-99k Installs", "100k-999k", "1M+ Installs"))
box()

# -------- Size ----------#
g_apps8 <- g_apps7
summary(g_apps8$Size) # This variable cannot be numeric since "Varies with device" is a value
## Varies with device           11000000           12000000 
##              10671               6927               6061 
##           13000000           14000000           15000000 
##               5307               4966               4908 
##           16000000           10000000           17000000 
##               4315               4236               3567 
##           18000000           19000000           20000000 
##               3444               3364               3074 
##           21000000           23000000           22000000 
##               2788               2606               2572 
##           24000000            3800000            3400000 
##               2556               2517               2501 
##            3300000           25000000            3600000 
##               2493               2492               2473 
##            3200000            3700000            3900000 
##               2456               2427               2368 
##            3500000            2900000            4000000 
##               2363               2305               2279 
##            3100000            2800000           26000000 
##               2274               2240               2229 
##            3000000            4100000            2600000 
##               2212               2107               2100 
##            4500000            4300000            2700000 
##               2077               2067               2014 
##            4900000           27000000            4200000 
##               2013               1968               1963 
##            2300000            2500000            2400000 
##               1925               1916               1912 
##            4400000            4600000            4700000 
##               1874               1801               1797 
##            4800000            5000000            5100000 
##               1794               1778               1778 
##           28000000            2200000            5200000 
##               1776               1726               1701 
##           30000000            5800000           29000000 
##               1686               1670               1612 
##            5300000            2000000            6000000 
##               1606               1605               1598 
##           31000000            5500000            5600000 
##               1594               1536               1519 
##            5400000            5900000            5700000 
##               1511               1490               1480 
##            2100000           32000000            6100000 
##               1464               1433               1324 
##            6400000            6200000            6500000 
##               1314               1306               1285 
##            1900000            6300000           33000000 
##               1284               1280               1277 
##           36000000           34000000            1800000 
##               1269               1256               1247 
##            6600000           35000000            7400000 
##               1214               1213               1202 
##            1700000            6900000            7200000 
##               1184               1168               1156 
##           37000000            7000000            6700000 
##               1141               1136               1124 
##            7300000            7100000            6800000 
##               1116               1111               1074 
##            7500000            7600000            1500000 
##               1070               1067               1052 
##           38000000            7900000            7700000 
##               1018               1015               1014 
##            8300000            1400000            8200000 
##               1011               1005               1003 
##            1600000            7800000           39000000 
##                997                974                964 
##            (Other) 
##              46858
# I'll add a new variable for size groupings
vwd <- which(g_apps8$Size == "Varies with device")
g_apps8.1 <- g_apps8[-vwd,]
summary(g_apps8.1$Size)
## 11000000 12000000 13000000 14000000 15000000 16000000 10000000 17000000 
##     6927     6061     5307     4966     4908     4315     4236     3567 
## 18000000 19000000 20000000 21000000 23000000 22000000 24000000  3800000 
##     3444     3364     3074     2788     2606     2572     2556     2517 
##  3400000  3300000 25000000  3600000  3200000  3700000  3900000  3500000 
##     2501     2493     2492     2473     2456     2427     2368     2363 
##  2900000  4000000  3100000  2800000 26000000  3000000  4100000  2600000 
##     2305     2279     2274     2240     2229     2212     2107     2100 
##  4500000  4300000  2700000  4900000 27000000  4200000  2300000  2500000 
##     2077     2067     2014     2013     1968     1963     1925     1916 
##  2400000  4400000  4600000  4700000  4800000  5000000  5100000 28000000 
##     1912     1874     1801     1797     1794     1778     1778     1776 
##  2200000  5200000 30000000  5800000 29000000  5300000  2000000  6000000 
##     1726     1701     1686     1670     1612     1606     1605     1598 
## 31000000  5500000  5600000  5400000  5900000  5700000  2100000 32000000 
##     1594     1536     1519     1511     1490     1480     1464     1433 
##  6100000  6400000  6200000  6500000  1900000  6300000 33000000 36000000 
##     1324     1314     1306     1285     1284     1280     1277     1269 
## 34000000  1800000  6600000 35000000  7400000  1700000  6900000  7200000 
##     1256     1247     1214     1213     1202     1184     1168     1156 
## 37000000  7000000  6700000  7300000  7100000  6800000  7500000  7600000 
##     1141     1136     1124     1116     1111     1074     1070     1067 
##  1500000 38000000  7900000  7700000  8300000  1400000  8200000  1600000 
##     1052     1018     1015     1014     1011     1005     1003      997 
##  7800000 39000000  8000000  (Other) 
##      974      964      948    45910
g_apps8.1$Size <- as.numeric(as.character(g_apps8.1$Size))
str(g_apps8.1$Size)
##  num [1:241940] 1400000 23000000 4100000 39000000 8100000 19000000 30000000 11000000 11000000 23000000 ...
fivenum(g_apps8.1$Size) # minimum value is 3.1 so I'll set "Varies with device to 1
## [1]         3.1   4000000.0   8300000.0  19000000.0 334000000.0
g_apps8$Size <- as.character(g_apps8$Size)
g_apps8$Size[g_apps8$Size == "Varies with device"] <- 1
g_apps8$Size <- as.numeric(as.character(g_apps8$Size))
g_apps8$Size.Groups <- cut(g_apps8$Size, breaks = c(0,3,4000000,8300000,19000000,334000000), labels = c("Varies with device", "0-4M",  "4M-8.3M", "8.3M-19M", "19M-334M"))
summary(g_apps8$Size.Groups)
## Varies with device               0-4M            4M-8.3M 
##              10671              60671              61008 
##           8.3M-19M           19M-334M 
##              60798              59463
sgtab <- table(g_apps8$Size.Groups)
psgtab <- prop.table(sgtab)
psgtab # groupings look proportionate aside from group 5 which is "Varies by device"
## 
## Varies with device               0-4M            4M-8.3M 
##         0.04224282         0.24017561         0.24150967 
##           8.3M-19M           19M-334M 
##         0.24067836         0.23539355
barplot(psgtab, main = "App Size Groupings", col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"), xlab = "Size Group", ylab = "Proportion", ylim = c(0,0.3))
legend("topleft", fill = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"), legend = levels(g_apps8$Size.Groups))

g_apps9 <- g_apps8
g_apps9$Price <- as.numeric(as.character(g_apps9$Price)) 
summary(g_apps9$Price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.2309   0.0000 399.9900
pt <- table(as.factor(g_apps9$Price))
ppt <- prop.table(pt)
ppt # over 95% of apps in this dataset cost $0 to download (Free), I'll convert to a factor
## 
##              0           0.67           0.99              1           1.01 
## 0.955876822466 0.000003958656 0.008685290823 0.000178139511 0.000011875967 
##           1.02           1.03           1.04           1.05           1.06 
## 0.000003958656 0.000007917312 0.000003958656 0.000011875967 0.000003958656 
##           1.07           1.08           1.09            1.1           1.11 
## 0.000003958656 0.000011875967 0.000019793279 0.000015834623 0.000015834623 
##           1.12           1.13           1.17           1.18           1.19 
## 0.000003958656 0.000003958656 0.000007917312 0.000003958656 0.000059379837 
##            1.2           1.21           1.22           1.23           1.25 
## 0.000035627902 0.000003958656 0.000003958656 0.000003958656 0.000019793279 
##           1.26           1.27           1.28           1.29            1.3 
## 0.000007917312 0.000003958656 0.000003958656 0.000063338493 0.000015834623 
##           1.32           1.33           1.34           1.35           1.36 
## 0.000011875967 0.000015834623 0.000015834623 0.000007917312 0.000011875967 
##           1.37           1.38           1.39            1.4           1.42 
## 0.000003958656 0.000007917312 0.000015834623 0.000007917312 0.000007917312 
##           1.43           1.44           1.45           1.48           1.49 
## 0.000003958656 0.000003958656 0.000019793279 0.000011875967 0.002858149487 
##            1.5           1.51           1.52           1.53           1.55 
## 0.000051462525 0.000007917312 0.000003958656 0.000003958656 0.000003958656 
##           1.56           1.58           1.59           1.61           1.62 
## 0.000003958656 0.000003958656 0.000039586558 0.000007917312 0.000007917312 
##           1.66           1.67           1.68           1.69            1.7 
## 0.000003958656 0.000003958656 0.000011875967 0.000011875967 0.000023751935 
##           1.72           1.74           1.75           1.77           1.78 
## 0.000003958656 0.000007917312 0.000031669246 0.000011875967 0.000007917312 
##           1.79            1.8           1.81           1.82           1.84 
## 0.000019793279 0.000023751935 0.000011875967 0.000007917312 0.000003958656 
##           1.85           1.86           1.88           1.89            1.9 
## 0.000007917312 0.000007917312 0.000003958656 0.000007917312 0.000011875967 
##           1.92           1.93           1.94           1.95           1.96 
## 0.000007917312 0.000003958656 0.000003958656 0.000035627902 0.000015834623 
##           1.97           1.98           1.99              2           2.02 
## 0.000019793279 0.000003958656 0.005937983698 0.000162304888 0.000003958656 
##           2.03           2.04           2.09            2.1           2.13 
## 0.000011875967 0.000007917312 0.000015834623 0.000007917312 0.000003958656 
##           2.15           2.19            2.2           2.25           2.27 
## 0.000003958656 0.000003958656 0.000007917312 0.000007917312 0.000003958656 
##           2.28           2.29            2.3           2.31           2.32 
## 0.000007917312 0.000015834623 0.000003958656 0.000007917312 0.000007917312 
##           2.33           2.35           2.36           2.37           2.39 
## 0.000003958656 0.000007917312 0.000003958656 0.000003958656 0.000011875967 
##            2.4           2.41           2.42           2.43           2.45 
## 0.000007917312 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           2.46           2.48           2.49            2.5           2.51 
## 0.000015834623 0.000011875967 0.001983286555 0.000039586558 0.000003958656 
##           2.52           2.54           2.55           2.57           2.59 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000027710591 
##            2.6           2.64           2.65           2.66           2.67 
## 0.000003958656 0.000003958656 0.000007917312 0.000003958656 0.000015834623 
##           2.69           2.71           2.73           2.74           2.77 
## 0.000007917312 0.000007917312 0.000003958656 0.000003958656 0.000011875967 
##           2.79            2.8           2.82           2.84           2.85 
## 0.000015834623 0.000003958656 0.000003958656 0.000007917312 0.000011875967 
##           2.86           2.89            2.9           2.93           2.94 
## 0.000003958656 0.000015834623 0.000031669246 0.000011875967 0.000003958656 
##           2.95           2.98           2.99              3           3.01 
## 0.000003958656 0.000003958656 0.005205632375 0.000083131772 0.000003958656 
##           3.03           3.06           3.08           3.09            3.1 
## 0.000003958656 0.000003958656 0.000023751935 0.000003958656 0.000003958656 
##           3.11           3.14           3.16           3.22           3.24 
## 0.000007917312 0.000003958656 0.000003958656 0.000007917312 0.000003958656 
##           3.25           3.26           3.27           3.28           3.29 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000007917312 
##           3.32           3.33           3.36           3.38           3.39 
## 0.000007917312 0.000003958656 0.000003958656 0.000007917312 0.000007917312 
##            3.4           3.41           3.42           3.43           3.49 
## 0.000003958656 0.000003958656 0.000007917312 0.000003958656 0.001302397758 
##            3.5           3.55           3.58           3.59           3.63 
## 0.000023751935 0.000011875967 0.000003958656 0.000007917312 0.000003958656 
##           3.65           3.72           3.75           3.77           3.78 
## 0.000003958656 0.000003958656 0.000015834623 0.000003958656 0.000003958656 
##            3.8           3.81           3.82           3.83           3.84 
## 0.000003958656 0.000007917312 0.000007917312 0.000007917312 0.000003958656 
##           3.85           3.89            3.9           3.91           3.93 
## 0.000003958656 0.000003958656 0.000019793279 0.000003958656 0.000003958656 
##           3.95           3.97           3.98           3.99              4 
## 0.000027710591 0.000003958656 0.000007917312 0.002945239914 0.000031669246 
##           4.03           4.04           4.06           4.14           4.17 
## 0.000003958656 0.000011875967 0.000003958656 0.000003958656 0.000003958656 
##           4.19            4.2           4.25           4.26           4.29 
## 0.000003958656 0.000007917312 0.000011875967 0.000003958656 0.000007917312 
##            4.3           4.31           4.34           4.35           4.38 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           4.39            4.4           4.41           4.42           4.44 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           4.48           4.49            4.5           4.53           4.54 
## 0.000003958656 0.000977787982 0.000007917312 0.000003958656 0.000003958656 
##           4.56           4.57           4.62           4.63           4.64 
## 0.000003958656 0.000003958656 0.000003958656 0.000007917312 0.000003958656 
##           4.69           4.71           4.72           4.73           4.74 
## 0.000011875967 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##           4.77            4.8           4.82           4.85           4.88 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           4.89            4.9           4.95           4.97           4.98 
## 0.000011875967 0.000015834623 0.000019793279 0.000003958656 0.000003958656 
##           4.99              5           5.01           5.29            5.3 
## 0.003388609364 0.000047503870 0.000003958656 0.000007917312 0.000003958656 
##           5.33           5.36           5.38            5.4           5.48 
## 0.000007917312 0.000003958656 0.000003958656 0.000003958656 0.000015834623 
##           5.49            5.5           5.55           5.57           5.69 
## 0.000577963747 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##           5.72           5.74           5.76           5.78           5.79 
## 0.000003958656 0.000007917312 0.000003958656 0.000003958656 0.000003958656 
##            5.9           5.95           5.96           5.98           5.99 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000866945620 
##              6           6.04           6.14           6.15           6.16 
## 0.000023751935 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##           6.21           6.27           6.28           6.29            6.3 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           6.49           6.52           6.54           6.57           6.58 
## 0.000249395315 0.000003958656 0.000015834623 0.000011875967 0.000003958656 
##           6.59           6.64           6.69            6.7           6.71 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##           6.73           6.84           6.85           6.87            6.9 
## 0.000003958656 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##           6.98           6.99              7           7.03           7.12 
## 0.000007917312 0.000653178207 0.000007917312 0.000003958656 0.000003958656 
##           7.14           7.23           7.49           7.53           7.55 
## 0.000003958656 0.000003958656 0.000253353971 0.000003958656 0.000003958656 
##           7.56           7.64           7.68           7.73           7.74 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000007917312 
##           7.85           7.89           7.99              8           8.34 
## 0.000003958656 0.000011875967 0.000676930142 0.000007917312 0.000003958656 
##           8.35           8.43           8.47           8.49           8.66 
## 0.000003958656 0.000003958656 0.000003958656 0.000134594297 0.000003958656 
##           8.69           8.75            8.8            8.9           8.92 
## 0.000003958656 0.000003958656 0.000007917312 0.000011875967 0.000003958656 
##           8.99              9           9.13           9.17           9.49 
## 0.000312733808 0.000003958656 0.000003958656 0.000003958656 0.000186056823 
##           9.71           9.76           9.79            9.8            9.9 
## 0.000003958656 0.000003958656 0.000003958656 0.000007917312 0.000011875967 
##           9.95           9.99             10          10.14          10.15 
## 0.000015834623 0.001053002442 0.000015834623 0.000003958656 0.000003958656 
##          10.25          10.49           10.7          10.75          10.99 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000178139511 
##             11          11.07           11.1          11.41          11.99 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000174180855 
##             12          12.49          12.62          12.93          12.99 
## 0.000011875967 0.000011875967 0.000003958656 0.000003958656 0.000193974134 
##          13.37          13.46          13.48          13.52          13.61 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##          13.99          14.01           14.5          14.73          14.93 
## 0.000106883707 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##          14.99             15          15.01          15.68          15.99 
## 0.000261271283 0.000007917312 0.000003958656 0.000003958656 0.000114801018 
##          16.17          16.99          17.92          17.98          17.99 
## 0.000003958656 0.000067297149 0.000003958656 0.000003958656 0.000071255804 
##             18           18.6          18.99          19.01          19.49 
## 0.000003958656 0.000003958656 0.000059379837 0.000003958656 0.000003958656 
##           19.5           19.7           19.9          19.95          19.98 
## 0.000003958656 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##          19.99          20.99             21          21.99          22.22 
## 0.000197932790 0.000039586558 0.000003958656 0.000055421181 0.000003958656 
##          22.99          23.32          23.45          23.92          23.99 
## 0.000055421181 0.000003958656 0.000003958656 0.000003958656 0.000023751935 
##          24.46          24.64          24.95          24.99             25 
## 0.000003958656 0.000003958656 0.000003958656 0.000174180855 0.000003958656 
##          25.99           26.9          26.99          27.01          27.47 
## 0.000003958656 0.000003958656 0.000031669246 0.000003958656 0.000003958656 
##           27.5          27.99          28.99          29.95          29.99 
## 0.000007917312 0.000039586558 0.000027710591 0.000007917312 0.000150428920 
##          30.99          31.99          32.99          33.99             34 
## 0.000003958656 0.000007917312 0.000039586558 0.000011875967 0.000003958656 
##          34.99             35          35.99          37.84          37.99 
## 0.000023751935 0.000003958656 0.000023751935 0.000003958656 0.000015834623 
##          38.99           39.8          39.99          40.99          42.99 
## 0.000031669246 0.000003958656 0.000023751935 0.000007917312 0.000003958656 
##          43.99          44.99          45.99          46.99             49 
## 0.000003958656 0.000003958656 0.000003958656 0.000007917312 0.000003958656 
##          49.95          49.99             50          52.25          53.47 
## 0.000003958656 0.000031669246 0.000003958656 0.000003958656 0.000003958656 
##          54.99          59.99          64.99             69          69.99 
## 0.000047503870 0.000007917312 0.000015834623 0.000003958656 0.000027710591 
##          74.99          79.99          81.18          84.99          89.99 
## 0.000011875967 0.000039586558 0.000003958656 0.000007917312 0.000003958656 
##          94.99             99           99.9          99.95          99.99 
## 0.000007917312 0.000003958656 0.000007917312 0.000003958656 0.000003958656 
##            100         104.99         109.99         114.99         119.99 
## 0.000003958656 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##         124.99         129.99         134.99            140         184.99 
## 0.000007917312 0.000007917312 0.000003958656 0.000003958656 0.000007917312 
##         199.99         234.99         244.99         289.99          299.9 
## 0.000011875967 0.000003958656 0.000003958656 0.000003958656 0.000003958656 
##         299.99         309.99         369.99         374.99         379.99 
## 0.000007917312 0.000003958656 0.000007917312 0.000003958656 0.000007917312 
##         389.99         399.99 
## 0.000003958656 0.000019793279
g_apps9$Price <- as.character(g_apps9$Price)
g_apps9$Price[g_apps9$Price %in% "0"] <- "Free"
g_apps9$Price[g_apps9$Price != "Free"] <- "Paid"
ptab <- table(g_apps9$Price)
pptab <- prop.table(ptab)
pptab
## 
##       Free       Paid 
## 0.95587682 0.04412318
g_apps9$Price <- as.factor(g_apps9$Price)
summary(g_apps9$Price) # 4.4% of apps cost money
##   Free   Paid 
## 241465  11146
barplot(ptab, main = "Price of Apps", col = c("#00DCFF", "#F83648"), ylab = "Number of apps", xlab = "Type of App", ylim = c(0,250000))
legend("topright", fill = c("#00DCFF", "#F83648"), legend = levels(g_apps9$Price))

g_apps10 <- g_apps9
summary(g_apps10$Content.Rating) 
##           $0.99           $2.49               0        100,000+ 
##               0               0               0               0 
##             17M            3702 Adults only 18+        Everyone 
##               0               0              11          228790 
##    Everyone 10+      Mature 17+            Teen         Unrated 
##            4337            3234           16206              33
g_apps10$Content.Rating <- as.factor(g_apps10$Content.Rating)
crtab <- table(g_apps10$Content.Rating)
pcrtab <- prop.table(crtab)
pcrtab 
## 
##           $0.99           $2.49               0        100,000+ 
##   0.00000000000   0.00000000000   0.00000000000   0.00000000000 
##             17M            3702 Adults only 18+        Everyone 
##   0.00000000000   0.00000000000   0.00004354521   0.90570086022 
##    Everyone 10+      Mature 17+            Teen         Unrated 
##   0.01716869020   0.01280229285   0.06415397588   0.00013063564
pie3D(crtab, explode = 0.1, main = "App Content Ratings", theta = 1.5,  labels = levels(g_apps10$Content.Rating), col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"))

# The groups are very disproportionate, lets group some together
g_apps10$Content.Rating <- as.character(g_apps10$Content.Rating)
g_apps10$Content.Rating[g_apps10$Content.Rating == "Everyone 10+"] <- "Mature"
g_apps10$Content.Rating[g_apps10$Content.Rating == "Adults only 18+"] <- "Mature"
g_apps10$Content.Rating[g_apps10$Content.Rating == "Teen"] <- "Mature"
g_apps10$Content.Rating[g_apps10$Content.Rating == "Unrated"] <- "Mature"
g_apps10$Content.Rating[g_apps10$Content.Rating == "Mature 17+"] <- "Mature"
g_apps10$Content.Rating <- as.factor(g_apps10$Content.Rating)
summary(g_apps10$Content.Rating)
## Everyone   Mature 
##   228790    23821
new_crtab <- table(g_apps10$Content.Rating)
new_pcrtab <- prop.table(new_crtab)
new_pcrtab
## 
##   Everyone     Mature 
## 0.90570086 0.09429914
pie3D(new_pcrtab, explode = 0.1, main = "App Content Ratings", theta = 1.5, col = c("#00DCFF", "#F83648"))
legend("topright", legend = levels(g_apps10$Content.Rating), fill = c("#00DCFF", "#F83648"))

g_apps11 <- g_apps10[,-c(10,11)] # Removed last and minimum versions because I won't be using those variables

Bivariate Analysis

table(is.na(g_apps11$Rating)) # No missing values
## 
##  FALSE 
## 252611
plot(g_apps11$Rating~g_apps11$Reviews, col = "black", main="Reviews vs App Ratings", 
     xlab = "Log Reviews", 
     ylab = "Ratings", pch=20,
     xlim = c(0, 25))
abline(lm(g_apps11$Rating~g_apps11$Reviews), col="#F83648", lwd=2.5)
lines(lowess(g_apps11$Rating~g_apps11$Reviews), col="#00DCFF", lwd=2.5)

library(corrplot)
cor(g_apps11$Rating, g_apps11$Reviews) # Very weak negative coorelation 
## [1] -0.1726915
cormat <- cor(g_apps11[,c(3,4)])
corrplot(cormat, method = "circle", addCoef.col = "red")

cor.test(g_apps11$Rating, g_apps11$Reviews)
## 
##  Pearson's product-moment correlation
## 
## data:  g_apps11$Rating and g_apps11$Reviews
## t = -88.119, df = 252609, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1764723 -0.1689057
## sample estimates:
##        cor 
## -0.1726915

The results from the correlation test show that we can reject the Null hypothesis. There is a correlation between Ratings and Reviews, however the correlation is a very weak, negative one at -0.17.

This also tells me that I can include both ratings and reviews in my regression model since multicollinearity is not an issue here.

Now I’ll look at how each variable affects the target variable, Installs.

install_size <- table(g_apps11$Size.Groups, g_apps11$Installs)
pinstall_size <- prop.table(install_size)
pinstall_size <- round(pinstall_size,2)
addmargins(pinstall_size,c(1,2))
##                     
##                      0-500 1000-10000 10000-100000 100000-1000000 1000000+
##   Varies with device  0.00       0.01         0.01           0.01     0.01
##   0-4M                0.04       0.07         0.08           0.04     0.01
##   4M-8.3M             0.04       0.07         0.08           0.04     0.01
##   8.3M-19M            0.04       0.07         0.08           0.04     0.01
##   19M-334M            0.03       0.06         0.07           0.05     0.02
##   Sum                 0.15       0.28         0.32           0.18     0.06
##                     
##                       Sum
##   Varies with device 0.04
##   0-4M               0.24
##   4M-8.3M            0.24
##   8.3M-19M           0.24
##   19M-334M           0.23
##   Sum                0.99
ptab_size <- prop.table(install_size, margin = 2)
barplot(ptab_size, col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"), main = "Installs vs Size", xlab = "Number of Installs", ylab = "Proportion")
legend("topright", fill = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444"), legend = levels(g_apps11$Size.Groups))

chisq.test(pinstall_size)
## Warning in chisq.test(pinstall_size): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  pinstall_size
## X-squared = 0.043817, df = 16, p-value = 1

The results of the chi-squared test are higher that 0.05 so we will accept the null hypothesis. It seems that size of app has no effect on the number of installs an app receives. This may actually be true if we assume that many consumers do not look at app size before downloading an app. However, the small amount of data we have on “varies with device” could be the reason the p-value is a 1. More data would be needed to conduct a more succesful test.

g_apps11 %>% group_by(Installs) %>% summarise(avg = mean(Rating), median = median(Rating), std = sd(Rating))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   Installs         avg median   std
##   <fct>          <dbl>  <dbl> <dbl>
## 1 0-500           4.65   4.97 0.473
## 2 1000-10000      4.35   4.43 0.464
## 3 10000-100000    4.28   4.37 0.403
## 4 100000-1000000  4.26   4.32 0.356
## 5 1000000+        4.25   4.29 0.312
boxplot(Rating~Installs, data = g_apps11, main="Boxplot of Ratings by Installs", xlab = "Install Group",
        col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444")) # Looks like apps with fewer intalls have a higher rating on average

rat_install.aov <- aov(Rating~Installs, data = g_apps11)
summary(rat_install.aov)
##                 Df Sum Sq Mean Sq F value              Pr(>F)    
## Installs         4   4352  1087.9    6191 <0.0000000000000002 ***
## Residuals   252606  44389     0.2                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rat_install.aov) # For coomparison of levels
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Rating ~ Installs, data = g_apps11)
## 
## $Installs
##                                     diff         lwr          upr
## 1000-10000-0-500            -0.300102885 -0.30737361 -0.292832160
## 10000-100000-0-500          -0.371134212 -0.37825409 -0.364014336
## 100000-1000000-0-500        -0.390823127 -0.39873979 -0.382906467
## 1000000+-0-500              -0.395944779 -0.40665866 -0.385230895
## 10000-100000-1000-10000     -0.071031327 -0.07690862 -0.065154031
## 100000-1000000-1000-10000   -0.090720242 -0.09754105 -0.083899431
## 1000000+-1000-10000         -0.095841894 -0.10577352 -0.085910264
## 100000-1000000-10000-100000 -0.019688915 -0.02634869 -0.013029135
## 1000000+-10000-100000       -0.024810567 -0.03463230 -0.014988833
## 1000000+-100000-1000000     -0.005121652 -0.01553546  0.005292153
##                                 p adj
## 1000-10000-0-500            0.0000000
## 10000-100000-0-500          0.0000000
## 100000-1000000-0-500        0.0000000
## 1000000+-0-500              0.0000000
## 10000-100000-1000-10000     0.0000000
## 100000-1000000-1000-10000   0.0000000
## 1000000+-1000-10000         0.0000000
## 100000-1000000-10000-100000 0.0000000
## 1000000+-10000-100000       0.0000000
## 1000000+-100000-1000000     0.6651233
#verify ANOVA assumptions with diagnostic plots
plot(rat_install.aov)

Here we can see that apps with fewer installs have higher ratings on average. This makes sense because apps with a high number of installs would have a large number of ratings. Each rating would carry less weight and at the same time a large number of people rating an app would likely introduce a wider range of ratings, driving the average rating down over time.

The p-value for the ANOVA test was low enough to reject the null hypothesis. I also used diagnostics plots to verify that the ANOVA assumptions were correct. The data does have some outliers but in general the plots look good. More specifically, the Residuals vs. Fitted plot shows that there is homogeniety among variances as there is no relationship between the residuals of each group. The Normal Q-Q plot tells us that residuals are normally distributed.

g_apps11 %>% group_by(Installs) %>% summarise(avg = mean(Reviews), median = median(Reviews), std = sd(Reviews))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   Installs         avg median   std
##   <fct>          <dbl>  <dbl> <dbl>
## 1 0-500           1.49   1.39  1.08
## 2 1000-10000      3.17   3.18  1.14
## 3 10000-100000    5.19   5.15  1.15
## 4 100000-1000000  7.41   7.38  1.18
## 5 1000000+        9.91   9.94  1.16
boxplot(Reviews~Installs, data = g_apps11, main="Boxplot of Reviews by Installs",
        col = c("#00DCFF", "#F83648", "#FFD800", "#04F075", "#444444")) # Looks fairly equaly, productivity has a slightly less rating.

rev_installs.aov <- aov(Reviews~Installs, data = g_apps11)
summary(rev_installs.aov)
##                 Df  Sum Sq Mean Sq F value              Pr(>F)    
## Installs         4 1357431  339358  260055 <0.0000000000000002 ***
## Residuals   252606  329638       1                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rev_installs.aov) # For coomparison of levels
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Reviews ~ Installs, data = g_apps11)
## 
## $Installs
##                                 diff      lwr      upr p adj
## 1000-10000-0-500            1.687458 1.667645 1.707272     0
## 10000-100000-0-500          3.700072 3.680670 3.719474     0
## 100000-1000000-0-500        5.921287 5.899714 5.942861     0
## 1000000+-0-500              8.424466 8.395270 8.453662     0
## 10000-100000-1000-10000     2.012613 1.996597 2.028629     0
## 100000-1000000-1000-10000   4.233829 4.215242 4.252416     0
## 1000000+-1000-10000         6.737008 6.709943 6.764072     0
## 100000-1000000-10000-100000 2.221216 2.203067 2.239364     0
## 1000000+-10000-100000       4.724394 4.697629 4.751159     0
## 1000000+-100000-1000000     2.503179 2.474800 2.531557     0
#verify ANOVA assumptions with diagnostic plots
plot(rev_installs.aov) # we have outliers

As expected, the apps with over 1 million installs have the highest number of reviews on average and apps with fewer than 1,000 installs have the lowest number of reviews. The p-value for the ANOVA test is low enough to make the assumption that these results are correct. Installs and Reviews have a positive relationship with each other. Diagnostic plots look good as well.

ptab <- table(g_apps11$Price, g_apps11$Installs)
pptab <- prop.table(ptab)
pptab <- round(pptab, 2)
addmargins(pptab, c(1,2))
##       
##        0-500 1000-10000 10000-100000 100000-1000000 1000000+  Sum
##   Free  0.13       0.27         0.31           0.18     0.06 0.95
##   Paid  0.02       0.01         0.01           0.00     0.00 0.04
##   Sum   0.15       0.28         0.32           0.18     0.06 0.99
ptab_price <- prop.table(ptab, margin = 2)
barplot(ptab_price, col = c("#00DCFF", "#F83648"), main = "Installs by Price Group", xlab = "Number of Installs", ylab = "Proportion of Apps")
legend("topright", legend = levels(g_apps11$Price), fill = c("#00DCFF", "#F83648")) # disproportionate so of course we expect this

chisq.test(pptab)
## Warning in chisq.test(pptab): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  pptab
## X-squared = 0.044366, df = 4, p-value = 0.9998

The p-value for the chi-squared test is 0.99 which is high enough to reject the null. This may be due to the disproportionate amount of data available on the price of apps in the dataset. I will not make any conclusions about Installs compared to the Price of an app.

crtab <- table(g_apps11$Content.Rating, g_apps11$Installs)
crtab <- prop.table(crtab)
pcrtab <- round(pptab, 2)
addmargins(pcrtab, c(1,2))
##       
##        0-500 1000-10000 10000-100000 100000-1000000 1000000+  Sum
##   Free  0.13       0.27         0.31           0.18     0.06 0.95
##   Paid  0.02       0.01         0.01           0.00     0.00 0.04
##   Sum   0.15       0.28         0.32           0.18     0.06 0.99
ptab_cr <- prop.table(crtab, margin = 2)
barplot(ptab_cr, col = c("#00DCFF", "#F83648"), main = "Installs by Content rating", xlab = "Number of Installs", ylab = "Proportion of Apps")
legend("topright", legend = levels(g_apps11$Content.Rating), fill = c("#00DCFF", "#F83648")) # disproportionate so of course we expect this

chisq.test(pcrtab)
## Warning in chisq.test(pcrtab): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  pcrtab
## X-squared = 0.044366, df = 4, p-value = 0.9998

The p-value for the chi-squared test is 0.99 which is high enough to reject the null. This may be due to the disproportionate amount of data available on content rating in the dataset. I will not make any conclusions about Installs compared to the Content Rating of an app.

cattab <- table(g_apps11$Category, g_apps11$Installs)
pcattab <- prop.table(cattab)
pcattab <- round(pcattab, 2)
addmargins(pcattab, c(1,2))
##                
##                 0-500 1000-10000 10000-100000 100000-1000000 1000000+  Sum
##   EDUCATION      0.05       0.09         0.10           0.04     0.01 0.29
##   ENTERTAINMENT  0.04       0.06         0.08           0.05     0.03 0.26
##   LIFESTYLE      0.04       0.07         0.07           0.04     0.01 0.23
##   PRODUCTIVITY   0.03       0.06         0.07           0.05     0.02 0.23
##   Sum            0.16       0.28         0.32           0.18     0.07 1.01
ptab_cat <- prop.table(cattab, margin = 2)
barplot(ptab_cat, col = c("#00DCFF", "#F83648", "#FFD800", "#04F075"), main = "Installs by Category", xlab = "Number of Installs", ylab = "Proportion of Apps")
legend("topright", legend = levels(g_apps11$Category), fill = c("#00DCFF", "#F83648", "#FFD800", "#04F075")) # disproportionate so of course we expect this

chisq.test(pcattab)
## Warning in chisq.test(pcattab): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  pcattab
## X-squared = 0.028116, df = 12, p-value = 1

The p-value for the chi-squared test is 1. It seems that Category has no effect on the number of Installs an app receives.

Hypothesis Testing

# Hypothesis 1: Education apps will have higher ratings that all other apps on average.

g_apps11 %>% group_by(Category) %>% summarise(avg = mean(Rating), median = median(Rating), std = sd(Rating))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 4
##   Category        avg median   std
##   <fct>         <dbl>  <dbl> <dbl>
## 1 EDUCATION      4.39   4.47 0.440
## 2 ENTERTAINMENT  4.37   4.42 0.412
## 3 LIFESTYLE      4.34   4.40 0.467
## 4 PRODUCTIVITY   4.28   4.33 0.431
boxplot(Rating~Category, data = g_apps11, main="Boxplot of Ratings by Installs",
        col = c("#00DCFF", "#F83648", "#FFD800", "#04F075")) # Looks fairly equaly, productivity has a slightly less rating.

rat_category.aov <- aov(Rating~Category, data = g_apps11)
summary(rat_category.aov)
##                 Df Sum Sq Mean Sq F value              Pr(>F)    
## Category         3    477  159.01   832.3 <0.0000000000000002 ***
## Residuals   252607  48264    0.19                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rat_category.aov) # For coomparison of levels
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Rating ~ Category, data = g_apps11)
## 
## $Category
##                                   diff         lwr         upr p adj
## ENTERTAINMENT-EDUCATION    -0.02201544 -0.02811043 -0.01592046     0
## LIFESTYLE-EDUCATION        -0.05476119 -0.06104986 -0.04847251     0
## PRODUCTIVITY-EDUCATION     -0.11474899 -0.12094437 -0.10855361     0
## LIFESTYLE-ENTERTAINMENT    -0.03274574 -0.03924470 -0.02624679     0
## PRODUCTIVITY-ENTERTAINMENT -0.09273355 -0.09914227 -0.08632482     0
## PRODUCTIVITY-LIFESTYLE     -0.05998780 -0.06658101 -0.05339460     0
#verify ANOVA assumptions with diagnostic plots
plot(rat_category.aov) # we have outliers

The results show that we can reject the null hypothesis. The Tukey pairwise test shows that there is a significant difference between the means of each category and Education apps actually do have higher ratings on average. Productivity apps have the lowest average rating of the 4 groups. Test results are significant.

I also used diagnostics plots to verify that the ANOVA assumptions were correct. The data does have some outliers but in general the plots look good. More specifically, the Residuals vs. Fitted plot shows that there is homogeniety among variances as there is no relationship between the residuals of each group. The Normal Q-Q plot tells us that residuals are normally distributed.

# Hypothesis 2: Education apps will have a higher number of reviews than Lifestyle apps
life_reviews <- mean(g_apps11$Reviews[g_apps11$Category=="LIFESTYLE"])
RC_Hypo <- t.test(g_apps11$Reviews[g_apps11$Category=="EDUCATION"],
                 alternative="greater", 
                 mu=life_reviews, 
                 conf.level=0.95)
RC_Hypo
## 
##  One Sample t-test
## 
## data:  g_apps11$Reviews[g_apps11$Category == "EDUCATION"]
## t = -18.972, df = 73186, p-value = 1
## alternative hypothesis: true mean is greater than 4.535357
## 95 percent confidence interval:
##  4.358875      Inf
## sample estimates:
## mean of x 
##  4.372955
# Let's plot the difference
Edu <- g_apps11$Reviews[g_apps11$Category =="EDUCATION"]
Life <- g_apps11$Reviews[g_apps11$Category == "LIFESTYLE"]
plot(density(Edu), main = "Difference between Education & Lifestyle Apps", lwd= 3, col= "#00DCFF", xlab = "Log of Reviews")
lines(density(Life), col="#F83648", lwd=3) 
legend("topright", c("Education", "lifestyle"), col = c("#00DCFF", "#F83648"), pch = c(19,19), cex = 0.8)

I will accept the null hypothesis for this test as the p-value is equal to 1.

# Hypothesis 3: On average, Paid apps will be rated higher than free apps. 

g_apps12 <- g_apps11
g_apps12 %>% group_by(Price) %>% summarise(avg = mean(Rating), median = median(Rating), std = sd(Rating))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   Price   avg median   std
##   <fct> <dbl>  <dbl> <dbl>
## 1 Free   4.35   4.40 0.440
## 2 Paid   4.39   4.45 0.420
boxplot(Rating~Price, data = g_apps12, main="Boxplot of Installs by Rating",
        col = c("#00DCFF", "#F83648")) #Very similar results

rat_price.aov <- aov(Rating~Price, data=g_apps11)
rat_price.aov
## Call:
##    aov(formula = Rating ~ Price, data = g_apps11)
## 
## Terms:
##                    Price Residuals
## Sum of Squares     16.65  48724.43
## Deg. of Freedom        1    252609
## 
## Residual standard error: 0.4391865
## Estimated effects may be unbalanced
summary(rat_price.aov)
##                 Df Sum Sq Mean Sq F value              Pr(>F)    
## Price            1     17  16.651   86.33 <0.0000000000000002 ***
## Residuals   252609  48724   0.193                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rat_price.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Rating ~ Price, data = g_apps11)
## 
## $Price
##                 diff        lwr        upr p adj
## Paid-Free 0.03953342 0.03119399 0.04787285     0
plot(rat_price.aov)

I will reject the null hypothesis for this test. There is a statistically significant differences between the ratings of free and paid apps. Paid apps actually have a higher rating on average than free apps. Diagnostics plots suggest that the ANOVA test results are valid.

# Hypothesis 4: Apps Rated Mature will have more reviews than apps for Everyone
g_apps12 %>% group_by(Content.Rating) %>% summarise(avg = mean(Reviews), median = median(Reviews), std = sd(Reviews))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   Content.Rating   avg median   std
##   <fct>          <dbl>  <dbl> <dbl>
## 1 Everyone        4.70   4.52  2.54
## 2 Mature          5.48   5.30  2.87
boxplot(Reviews~Content.Rating, data = g_apps12, main = "Boxplot of # of Reviews by Content Rating", 
        col=c("#00DCFF", "#F83648")) #Looks like mature apps get more reviews on average

rev_content.aov <- aov(Reviews~Content.Rating, data=g_apps11)
summary(rev_content.aov)
##                    Df  Sum Sq Mean Sq F value              Pr(>F)    
## Content.Rating      1   13112   13112    1979 <0.0000000000000002 ***
## Residuals      252609 1673956       7                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rev_content.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Reviews ~ Content.Rating, data = g_apps11)
## 
## $Content.Rating
##                     diff       lwr       upr p adj
## Mature-Everyone 0.779582 0.7452323 0.8139318     0
plot(rev_content.aov)

I reject the null hypothesis for this test. Mature apps have higher reviews than apps for Everyone.

# Hypothesis 5: Apps with higher number of Installs will have a higher number of reviews

g_apps12$Installs <- as.character(g_apps12$Installs)
g_apps12$InstallCat <- rep(NA, nrow(g_apps12))
g_apps12$InstallCat[g_apps12$Installs %in% c("100000-1000000","1000000+")] <- "High"
g_apps12$InstallCat[g_apps12$Installs %in% c("10000-100000","1000-10000","0-500")] <- "Low"
g_apps12$InstallCat <- as.factor(g_apps12$InstallCat)
g_apps12$Installs <- as.factor(g_apps12$Installs)
summary(g_apps12$InstallCat)
##   High    Low 
##  62708 189903
low_mean <- mean(g_apps12$Reviews[g_apps12$InstallCat == "Low"])

RI_Hypo <- t.test(g_apps12$Reviews[g_apps12$InstallCat=="High"],
                 alternative="greater", 
                 mu=low_mean, 
                 conf.level=0.95)
RI_Hypo
## 
##  One Sample t-test
## 
## data:  g_apps12$Reviews[g_apps12$InstallCat == "High"]
## t = 678.53, df = 62707, p-value < 0.00000000000000022
## alternative hypothesis: true mean is greater than 3.693875
## 95 percent confidence interval:
##  8.047836      Inf
## sample estimates:
## mean of x 
##  8.058416
# Let's plot the difference
High_Installs <- g_apps12$Reviews[g_apps12$InstallCat =="High"]
Low_Installs <- g_apps12$Reviews[g_apps12$InstallCat == "Low"]
plot(density(High_Installs), main = "Number of Reviews for High & Low Apps", lwd= 3, col= "#00DCFF", xlab = "Log of Reviews")
lines(density(Low_Installs), col="#F83648", lwd=3) 
legend("topright", c("High Installs", "Low Installs"), col = c("#00DCFF", "#F83648"), pch = c(19,19), cex = 0.8)

Test results show that we can reject the null hypothesis. Apps with higher installs recieve more reviews on average.

I will also use this created category for logistic regression to determine what variables are best for predicting whether or not an app will recieve a high or low amount of downloads.

Regression

options(scipen = 99)
g_apps13 <- g_apps12

g_mod0 <- glm(InstallCat~1, data = g_apps13, family = "binomial")
g_mod <- glm(InstallCat ~ Rating, data=g_apps13, family = "binomial")
summary(g_mod)$coef
##               Estimate Std. Error   z value      Pr(>|z|)
## (Intercept) -1.5498513 0.04449110 -34.83509 7.161995e-266
## Rating       0.6151776 0.01030832  59.67777  0.000000e+00
1-logLik(g_mod)/logLik(g_mod0)# Using a calculation for McFaddens R Squared to figure out how much of the variance can be explained by this model.
## 'log Lik.' 0.01253825 (df=2)

I start with simple logistic regression to determine if ratings alone would be a good predictor for installs. The coefficient for rating shows a positive relationship between these two variables. As the rating of an app increases, the log likelihood that the number of Install will increase also increases. These results are statistically significant witha p-value less than 0.05.

By using McFadsdens R squared calculation for logistic regression, I see that the simple logistic regression model only explains about 1% of the variance in Installs. I will need a multiple logistic regression model to improve accuracy.

null <- glm(Installs ~ 1, data = g_apps13, family = "binomial") # first model
full <- glm(Installs ~ Category + Reviews + Rating + Size.Groups + Price + Content.Rating, data= g_apps13, family = "binomial") # second model
step(null, scope=list(lower=null, upper=full), direction="forward") # forward selection
## Start:  AIC=213565.1
## Installs ~ 1
## 
##                  Df Deviance    AIC
## + Reviews         1   100229 100233
## + Rating          1   188315 188319
## + Price           1   210157 210161
## + Size.Groups     4   211995 212005
## + Category        3   213197 213205
## + Content.Rating  1   213366 213370
## <none>                213563 213565
## 
## Step:  AIC=100233.4
## Installs ~ Reviews
## 
##                  Df Deviance    AIC
## + Price           1    93444  93450
## + Rating          1    96644  96650
## + Category        3   100083 100093
## + Size.Groups     4   100208 100220
## <none>                100229 100233
## + Content.Rating  1   100229 100235
## 
## Step:  AIC=93450.38
## Installs ~ Reviews + Price
## 
##                  Df Deviance   AIC
## + Rating          1    89612 89620
## + Category        3    93357 93369
## + Size.Groups     4    93392 93406
## <none>                 93444 93450
## + Content.Rating  1    93444 93452
## 
## Step:  AIC=89619.95
## Installs ~ Reviews + Price + Rating
## 
##                  Df Deviance   AIC
## + Category        3    89480 89494
## + Size.Groups     4    89576 89592
## <none>                 89612 89620
## + Content.Rating  1    89611 89621
## 
## Step:  AIC=89493.79
## Installs ~ Reviews + Price + Rating + Category
## 
##                  Df Deviance   AIC
## + Size.Groups     4    89444 89466
## + Content.Rating  1    89476 89492
## <none>                 89480 89494
## 
## Step:  AIC=89465.86
## Installs ~ Reviews + Price + Rating + Category + Size.Groups
## 
##                  Df Deviance   AIC
## + Content.Rating  1    89439 89463
## <none>                 89444 89466
## 
## Step:  AIC=89463.23
## Installs ~ Reviews + Price + Rating + Category + Size.Groups + 
##     Content.Rating
## 
## Call:  glm(formula = Installs ~ Reviews + Price + Rating + Category + 
##     Size.Groups + Content.Rating, family = "binomial", data = g_apps13)
## 
## Coefficients:
##           (Intercept)                Reviews              PricePaid  
##               2.57764                1.59061               -2.98005  
##                Rating  CategoryENTERTAINMENT      CategoryLIFESTYLE  
##              -1.09773               -0.21786               -0.17365  
##  CategoryPRODUCTIVITY        Size.Groups0-4M     Size.Groups4M-8.3M  
##              -0.24313               -0.25197               -0.34222  
##   Size.Groups8.3M-19M    Size.Groups19M-334M   Content.RatingMature  
##              -0.30781               -0.29209                0.06937  
## 
## Degrees of Freedom: 252610 Total (i.e. Null);  252599 Residual
## Null Deviance:       213600 
## Residual Deviance: 89440     AIC: 89460
step(full, data=g_apps13, direction="backward") # backward selection
## Start:  AIC=89463.23
## Installs ~ Category + Reviews + Rating + Size.Groups + Price + 
##     Content.Rating
## 
##                  Df Deviance    AIC
## <none>                 89439  89463
## - Content.Rating  1    89444  89466
## - Size.Groups     4    89476  89492
## - Category        3    89574  89592
## - Rating          1    93305  93327
## - Price           1    96382  96404
## - Reviews         1   182831 182853
## 
## Call:  glm(formula = Installs ~ Category + Reviews + Rating + Size.Groups + 
##     Price + Content.Rating, family = "binomial", data = g_apps13)
## 
## Coefficients:
##           (Intercept)  CategoryENTERTAINMENT      CategoryLIFESTYLE  
##               2.57764               -0.21786               -0.17365  
##  CategoryPRODUCTIVITY                Reviews                 Rating  
##              -0.24313                1.59061               -1.09773  
##       Size.Groups0-4M     Size.Groups4M-8.3M    Size.Groups8.3M-19M  
##              -0.25197               -0.34222               -0.30781  
##   Size.Groups19M-334M              PricePaid   Content.RatingMature  
##              -0.29209               -2.98005                0.06937  
## 
## Degrees of Freedom: 252610 Total (i.e. Null);  252599 Residual
## Null Deviance:       213600 
## Residual Deviance: 89440     AIC: 89460
step(null, scope = list(upper=full), data=g_apps13, direction="both") # Stepwise selection
## Start:  AIC=213565.1
## Installs ~ 1
## 
##                  Df Deviance    AIC
## + Reviews         1   100229 100233
## + Rating          1   188315 188319
## + Price           1   210157 210161
## + Size.Groups     4   211995 212005
## + Category        3   213197 213205
## + Content.Rating  1   213366 213370
## <none>                213563 213565
## 
## Step:  AIC=100233.4
## Installs ~ Reviews
## 
##                  Df Deviance    AIC
## + Price           1    93444  93450
## + Rating          1    96644  96650
## + Category        3   100083 100093
## + Size.Groups     4   100208 100220
## <none>                100229 100233
## + Content.Rating  1   100229 100235
## - Reviews         1   213563 213565
## 
## Step:  AIC=93450.38
## Installs ~ Reviews + Price
## 
##                  Df Deviance    AIC
## + Rating          1    89612  89620
## + Category        3    93357  93369
## + Size.Groups     4    93392  93406
## <none>                 93444  93450
## + Content.Rating  1    93444  93452
## - Price           1   100229 100233
## - Reviews         1   210157 210161
## 
## Step:  AIC=89619.95
## Installs ~ Reviews + Price + Rating
## 
##                  Df Deviance    AIC
## + Category        3    89480  89494
## + Size.Groups     4    89576  89592
## <none>                 89612  89620
## + Content.Rating  1    89611  89621
## - Rating          1    93444  93450
## - Price           1    96644  96650
## - Reviews         1   184785 184791
## 
## Step:  AIC=89493.79
## Installs ~ Reviews + Price + Rating + Category
## 
##                  Df Deviance    AIC
## + Size.Groups     4    89444  89466
## + Content.Rating  1    89476  89492
## <none>                 89480  89494
## - Category        3    89612  89620
## - Rating          1    93357  93369
## - Price           1    96414  96426
## - Reviews         1   184493 184505
## 
## Step:  AIC=89465.86
## Installs ~ Reviews + Price + Rating + Category + Size.Groups
## 
##                  Df Deviance    AIC
## + Content.Rating  1    89439  89463
## <none>                 89444  89466
## - Size.Groups     4    89480  89494
## - Category        3    89576  89592
## - Rating          1    93307  93327
## - Price           1    96387  96407
## - Reviews         1   183080 183100
## 
## Step:  AIC=89463.23
## Installs ~ Reviews + Price + Rating + Category + Size.Groups + 
##     Content.Rating
## 
##                  Df Deviance    AIC
## <none>                 89439  89463
## - Content.Rating  1    89444  89466
## - Size.Groups     4    89476  89492
## - Category        3    89574  89592
## - Rating          1    93305  93327
## - Price           1    96382  96404
## - Reviews         1   182831 182853
## 
## Call:  glm(formula = Installs ~ Reviews + Price + Rating + Category + 
##     Size.Groups + Content.Rating, family = "binomial", data = g_apps13)
## 
## Coefficients:
##           (Intercept)                Reviews              PricePaid  
##               2.57764                1.59061               -2.98005  
##                Rating  CategoryENTERTAINMENT      CategoryLIFESTYLE  
##              -1.09773               -0.21786               -0.17365  
##  CategoryPRODUCTIVITY        Size.Groups0-4M     Size.Groups4M-8.3M  
##              -0.24313               -0.25197               -0.34222  
##   Size.Groups8.3M-19M    Size.Groups19M-334M   Content.RatingMature  
##              -0.30781               -0.29209                0.06937  
## 
## Degrees of Freedom: 252610 Total (i.e. Null);  252599 Residual
## Null Deviance:       213600 
## Residual Deviance: 89440     AIC: 89460
1-logLik(full)/logLik(null) # Using a calculation for McFaddens R Squared to figure out how much of the variance can be explained by this model.
## 'log Lik.' 0.5812048 (df=12)

I used forward selection, backwards elimination and stepwise regression to find the best model for Installs prediction. Each method returned the same model with the same AIC value of 89460. The final model below uses all relevant variables in the dataset however, McFaddens R squared calculation shows that it only explains about 58% of the variance in the Install variable.

final <- glm(formula = Installs ~ Reviews + Price + Rating + Category + 
    Size.Groups + Content.Rating, family = "binomial", data = g_apps13)
summary(final)
## 
## Call:
## glm(formula = Installs ~ Reviews + Price + Rating + Category + 
##     Size.Groups + Content.Rating, family = "binomial", data = g_apps13)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4283   0.0069   0.0597   0.2439   3.5785  
## 
## Coefficients:
##                        Estimate Std. Error z value             Pr(>|z|)
## (Intercept)            2.577643   0.107844  23.902 < 0.0000000000000002
## Reviews                1.590614   0.008831 180.120 < 0.0000000000000002
## PricePaid             -2.980053   0.036327 -82.034 < 0.0000000000000002
## Rating                -1.097728   0.018374 -59.745 < 0.0000000000000002
## CategoryENTERTAINMENT -0.217857   0.023541  -9.254 < 0.0000000000000002
## CategoryLIFESTYLE     -0.173653   0.023423  -7.414    0.000000000000123
## CategoryPRODUCTIVITY  -0.243129   0.024475  -9.934 < 0.0000000000000002
## Size.Groups0-4M       -0.251971   0.066287  -3.801             0.000144
## Size.Groups4M-8.3M    -0.342221   0.066499  -5.146    0.000000265756169
## Size.Groups8.3M-19M   -0.307805   0.066705  -4.614    0.000003941567053
## Size.Groups19M-334M   -0.292086   0.066933  -4.364    0.000012779868712
## Content.RatingMature   0.069374   0.032315   2.147             0.031811
##                          
## (Intercept)           ***
## Reviews               ***
## PricePaid             ***
## Rating                ***
## CategoryENTERTAINMENT ***
## CategoryLIFESTYLE     ***
## CategoryPRODUCTIVITY  ***
## Size.Groups0-4M       ***
## Size.Groups4M-8.3M    ***
## Size.Groups8.3M-19M   ***
## Size.Groups19M-334M   ***
## Content.RatingMature  *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 213563  on 252610  degrees of freedom
## Residual deviance:  89439  on 252599  degrees of freedom
## AIC: 89463
## 
## Number of Fisher Scoring iterations: 8

As we can see in the final model, all variables included in the model are statistically significant. Test data can be introduced to make predictions and test the accuracy of the model but that is beyond the scope of this project. As of right now the McFaddens R squared calcuation shows that the model explains 58% of the variance in Installs. I could also change the probability threshold to be greater than 0.5 and see if that improves the model accuracy.

In sum, we see that all models used in this dataset were important in predicting the amount of install an app would receive. More data on these apps could be included in the data to balance out some of the disproportionate variable and make a more accurate prediction. App developers could also benefit from more data on metrics that explain consumers are using these apps once they have been installed. Further analysis is needed but the analysis in this report is a great start.